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1.  Summary 


We  have  been  earrying  out  researeh  on  the  task  of  representing,  reasoning  and  learning  about 
network  interruptions.  We  have  developed  a  distributed  system  whereby  eaeh  node  in  the 
network  monitors  itself  in  real-time  and  develops  and  model  of  its  normal  traffie  behavior. 

When  a  eonneetion  error  oeeurs,  the  originating  node  sends  an  error  report  to  a  limited  set  of 
relevant  nodes.  Eaeh  node  traeks  these  error  messages  that  are  passed  through  the  network  and 
in  eonjunetion  with  its  internal  model  of  the  network,  infers  the  likely  loeation  of  the  network 
interruption  with  a  probabilistie  model. 

To  test  our  system,  we  have  developed  a  network  simulator  that  automatieally  ereates  arbitrary 
sized  networks  and  generates  traffie  aeeording  to  a  speeified  probability  distribution.  The 
number  of  end  nodes  per  major  hub  is  a  parameter  that  ean  be  set  from  10  to  1000.  We  injeeted 
faults  into  this  network  and  monitored  the  ability  of  individual  nodes  to  deteet  the  fault.  From 
experimental  runs  with  the  simulator  we  have  observed  the  following. 

•  Loeal  diagnosis  of  distant  eonneetion  problem  is  feasible. 

•  The  eommunieation  overhead  to  send  error  messages  is  small  eompared  with  regular 
traffie. 

•  Nodes  in  the  network  do  not  have  equal  ability  to  infer  problems  eorreetly.  The  nodes 
tend  to  “speeialize”  and  are  better  able  to  deteet  errors  in  nodes  with  whieh  they  typieally 
exehange  more  traffie  than  average. 

•  The  memory  overhead  to  implement  this  system  is  large  but  does  not  grow  as  the 
simulation  runs.  The  eost  is  also  distributed  aeross  the  network. 


We  believe  future  researeh  on  this  topie  should  eoneentrate  on  extending  the  representational 
power  of  the  models  used  to  represent  traffie  flows  with  the  goal  of  deteeting  a  wider  variety  of 
eonneetion  problems. 


2.  Introduction 

In  this  projeet,  we  ereated  a  simple  demonstration  system  to  test  the  feasibility  of  learning  in  the 
presenee  of  distributed  information  (as  would  be  the  ease  for  a  knowledge-plane  eonstruet). 
Speeifically,  we  examined  the  following  problem. 

Given  a  network  of  nodes  that  ean  monitor  themselves  and  eommunieate,  build  a 
distributed  system  that  automatieally  deteets  eonneetion  errors  and  propagates  this 
information  as  needed. 

We  believe  a  distributed  system  is  neeessary  and  even  advantageous  in  many  respeets.  Although 
a  eentralized  solution  is  possible,  it  is  less  desirable  beeause  it  requires  eommunieation  with  a 
eentral  node  whieh  may  not  be  aeeessible  if  there  are  network  problems.  Furthermore  a 
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distributed  system  will  spread  out  the  eost  of  implementation.  From  this  study  we  hope  to 
aehieve  three  things. 

1 .  Demonstrate  that  distributed  diagnosis  is  feasible. 

2.  Gain  a  qualitative  understanding  of  issues  involved  with  eonstruction  of  sueh  a 
system. 

3.  Identify  open  problems  and  tasks  for  future  researeh. 


2.1  Current  Practice 

A  eommon  method  of  identifying  failed  network  eonneetions  is  to  use  the  tool  traeeroute  whieh 
traeks  the  path  that  paekets  take  to  the  destination.  Traeeroute  works  well  but  has  some 
important  limitations: 

1 .  Traeeroute  eannot  deteet  intermittent  problems.  It  only  determines  if  a  eonneetion  ean  be 
made  right  now. 

2.  Not  all  routers  provide  the  information  that  traeeroute  needs.  E.g.,  some  fail  to  observe 
“time  to  live”  or  do  not  respond  at  all  (eausing  traeeroute  to  time  out) 

3.  Traeeroute  does  not  tell  if  a  speeifie  serviee  is  available  but  only  if  the  nodes  respond  to 
ICMP. 

4.  If  the  destination  node  is  a  popular  site  (e.g.,  CNN),  multiple  requesting  sites  all 
performing  traeeroute  will  result  in  heavy  bandwidth. 

5.  There  is  no  extension  path  from  traeeroute  to  more  sophistieated  models  that  would  allow 
reasoning  about  other  network  eharaeteristies.  Thus,  it  would  be  diffieult  to  extend  a 
traeeroute  system  to  serve  requests  like  “Sehedule  my  download  of  file  X  when  the 
eonneetion  to  node  Y  is  fastest”. 


3.  Methods,  Assumptions,  and  Procedures 

We  assume  that  the  network  is  instrumented  with  think-points  (TP)  [2]  at  major  hubs 
(autonomous  systems)  and  end  user  stations.  Eaeh  TP  is  self  contained  reasoning  unit  that  can 
observe  network  traffic  through  that  point.  Eurthermore,  when  a  node  makes  a  traffic  request 
that  fails  to  reach  its  destination  (no  return  information)  each  TP  can  generate  an  error  message 
that  is  propagated  to  other  nodes  in  the  network. 

In  general  terms,  our  framework  for  detecting  network  failures  works  as  follows: 

•  Each  TP  learns  a  model  T  of  its  traffic. 

•  Each  TP  learns  a  model  E  of  recent  failed  requests. 
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•  Given  T  and  E  the  TP  infers  the  state  of  the  network. 


In  this  report,  we  will  assume  that  the  only  failure  mode  is  a  site  going  eompletely  offline,  in 
whieh  ease  it  ean  neither  reeeive  nor  forward  messages.  Eaeh  TP  develops  a  hierarehieal  model 
of  its  own  traffic  and  errors.  When  a  connection  error  occurs  the  source  node  propagates  an  error 
message  to  TPs  along  the  same  routing  path  as  the  initial  request.  By  examining  the  traffic  and 
error  each  node  can  reason  about  the  network  state. 

We  discuss  an  example  which  shows  how  the  system  might  work  at  a  general  level  and  then  we 
describe  the  components  in  greater  detail. 


3. 1  Example  Problem 

We  illustrate  our  ideas  with  an  example  problem.  Consider  a  system  composed  of  four 
autonomous  systems  (AS):  A,  B,  C,  and  D  and  shown  in  Figure  2.  Each  AS  is  connected  to  a 
variety  of  lower  level  nodes  (e.g.,  al,  a2,  a3  etc);  the  lower  nodes  for  B  and  D  are  not  shown  for 
clarity  reasons.  We  will  use  uppercase  letters  for  autonomous  systems  and  lowercase  for  end 
nodes.  Each  AS  may  have  many  associated  end  nodes.  Finally,  each  AS  has  an  associated  TP 
which  can  observe  and  reason  from  its  observations. 


Figure  1:  Example  Network 


Suppose  the  node  al  makes  a  request  of  node  cl,  which  under  working  conditions  would  be 
routed  along  the  path  {al.  A,  B,  C,  cl}.  However,  there  is  an  error  in  the  network  and  al 
receives  no  response  from  cl.  Once  al  determines  that  its  attempt  to  connect  to  cl  has  failed 
(i.e.,  no  response  after  a  fixed  time  limit),  the  TP  at  node  al  then  generates  an  error  report  and 
sends  it  along  the  same  route  as  the  request.  Note  that  both  the  traffic  request  and  error  message 
will  be  forwarded  along  the  routing  path  until  it  encounters  a  failed  node.  Each  node  on  the  path 
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takes  note  of  al ’s  trouble  report  and  updates  its  own  statisties  about  the  nodes  that  have 
eonneetion  errors. 

The  network  outage  eould  be  at  several  loeations  along  the  path  from  al  to  el,  and  possible 
eandidates  are  al,  A,  B,  C,  or  el.  The  TP  at  al  ean  infer  based  on  T  and  E  the  likely  loeation  of 
the  fault  with  the  following  type  of  reasoning. 

•  If  al  does  a  lot  of  traffie  with  o2,  and  there  are  no  failures  al-e2  then  al  ean  conelude  that 
the  outage  is  at  el.  To  reaeh  o2  the  path  {al,  A,  B,  C}  must  be  operational  and  thus  the 
error  has  to  be  at  el. 

•  If  al  also  experienees  many  other  failed  requests  to  nodes  sueh  as  B  or  D  then  the  error 
may  be  eloser  in  the  network  (e.g.,  B). 

•  If  al  ean  not  eommunieate  with  any  other  node,  then  the  error  is  probably  with  al  itself. 

The  TP  at  A  also  repeats  this  analysis  proeess.  However,  A  proeesses  mueh  more  traffie  as  it 
routes  messages  from  a2  and  a3  in  addition  to  al .  Thus  A  should  be  able  to  better  distinguish  the 
failure  eause  sinee  it  has  more  observations  from  whieh  to  make  an  inferenee.  Likewise,  the 
other  nodes  that  reeeive  the  error  message  (i.e.,  B  and  D)  also  attempt  to  infer  the  error  loeation. 

In  the  next  two  seetions,  we  discuss  how  the  TPs  can  model  their  traffic  and  errors,  and  how  they 
can  use  this  information  to  reason  probabilistically  about  the  possible  faults. 


3.2  Traffic  and  Error  Models 

We  model  the  traffic  and  errors  with  a  two-level  hierarchy.  Each  node  keeps  track  of  the  traffic 
going  to  any  destination  (end  node)  and  the  corresponding  AS.  We  assume  that  every  message 
sent  has  an  address  that  can  be  decoded  into  an  AS  and  end  node. 

The  TPs  keep  track  of  the  traffic  to  the  various  destinations  with  an  exponentially  weighted 
moving  average: 


T{n,t)  =  T {n,t  -  \){X)  +  M {n,t){\  -  X)  (1) 

The  function  T{n,t)  represents  the  estimated  traffic  demand  to  node  n  at  time  t  and  is  a  weighted 
sum  of  the  previous  traffic  estimate  plus  the  messages  sent  to  n  at  time  t.  The  variable  k  is  a 
decay  constant  and  should  be  set  between  zero  and  one.  Eor  our  experiments,  0.9  was  used  as 
the  value  of  k.  The  exponentially  weighted  moving  average  has  the  advantage  of  being 
extremely  simple  to  update  and  requires  recording  only  one  number  per  destination. 

We  store  the  traffic  counts  for  different  destinations  in  a  hash  structure  which  allows  fast  access 
and  retrieval  of  counts.  However,  as  a  node  makes  requests  to  different  destinations,  the  number 
of  counters  needed  may  grow.  To  prevent  the  size  of  the  data  structure  necessary  to  store  the 
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traffic  numbers  growing  too  large,  addresses  with  small  scores  are  deleted  from  the  structure. 
Thus  for  example,  a  rare  traffic  destination  would  be  quickly  removed  from  the  data  structure. 


3.3  Reasoning  about  errors 

We  treat  the  reasoning  problem  as  a  classification  problem  where  the  classes  correspond  to  the 
possible  failure  states  of  the  network.  Each  TP  maintains  a  structure  of  traffic  and  errors,  and 
from  these  can  infer  potential  outages  in  the  network.  The  TP  then  selects  the  class  that  is  most 
probable  given  the  observations. 

We  use  a  naive  Bayesian  classifier  to  perform  the  inference  [1].  The  naive  Bayesian  classifier  is 
based  on  Bayes  rule  whieh  states 


P{c/x)  = 


P{xlc)P{c) 

P{x) 


(2) 


i.e.,  the  probability  of  a  class  c  given  an  observation  vector  x  is  proportional  to  the  probability  of 
X  given  c.  The  term  x  corresponds  to  the  requests  that  were  either  successful  or  failures  and  can 
be  computed  from  T  and  E.  The  naive  Bayesian  classifier  makes  the  assumption  that  all 
observations  are  independent  given  the  class  and  thus 


(3) 


For  each  AS  and  its  associated  end  nodes  we  consider  two  cases.  The  first  is  that  the  AS  itself  is 
broken.  If  a  node  is  broken  we  expect  that  with  probability  5  a  traffic  request  will  result  in  an 
error,  and  1  -  6  no  error.  Note  that  5  is  not  set  to  zero  beeause  the  exponentially  weighted 
moving  average  incorporates  information  from  the  recent  history  of  traffic  and  thus  even  if  the 
node  is  now  working  correctly,  the  estimate  may  include  past  errors.  For  a  hub  failure  we  define 

(4) 

i 


Where  t  and  e  represent  the  total  counts  from  the  traffic  models  for  the  hub  of  interest. 

The  seeond  ease  we  eonsider  is  that  some  individual  end  nodes  are  broken  but  the  AS  is  fine. 
We  model  the  joint  probability  of  the  individual  end  nodes  generating  the  observations  as  the 
product  of  the  probabilities  for  those  individual  nodes  that  are  hypothesized  to  be  working  or 
experiencing  a  failure. 

nm/cw.,)=  n  n *0- <5)“"'’ n  nr' •(!-)')"'"'>  (s) 

i  ^working  i  g  failed  i 

Where  ti  and  e,  represent  the  total  counts  to  each  end  node,  and  y  is  the  probability  of  observing 
an  error  for  a  working  node. 
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4.  Results  and  Discussion 


In  this  section,  we  discuss  experiments  with  our  system  in  detecting  network  errors.  We  note 
that  we  are  most  concerned  with  discovering  qualitative  behaviors  of  the  system  and  not  detailed 
quantitive  comparisons. 

We  implemented  our  system  in  C++  and  ran  experiments  under  UNIX.  We  generated  random 
networks  that  varied  the  number  of  AS  and  end  nodes.  Additionally,  traffic  was  generated  by 
randomly  selecting  the  origin  and  destination  according  to  a  non-uniform  distribution. 

We  experimented  with  several  fixed  structures  as  in  Figure  1,  as  well  as  randomly  generated 
networks  where  we  inserted  faults.  The  network  ranges  in  size  from  a  small  system  with  4  AS 
and  approximately  10  end  nodes  per  AS  to  100  AS  each  with  100  end  nodes. 

We  present  here  the  results  of  a  simulation  run  on  Figure  1  where  a  network  error  was  injected  at 
node  C.  Figure  2  shows  the  inferred  errors  for  the  TP  located  at  nodes  B,  b3  and  A.  From  the 
graph,  we  make  the  following  conclusions. 


time 


True 

B 

b3 

A 


C 


C 


c2  c2  e3 


► 


Figure  2:  Simulation  trace  with  a  network  error.  The  timelines  show  the  inferred  error  hy  the  TPs  at  points 

B,  h3,  and  A. 


•  All  nodes  were  able  to  detect  a  failure  in  node  C. 

•  There  is  a  lag  in  detecting  the  failure.  This  is  a  direct  result  of  using  a  moving  average  to 
store  traffic  and  error  information. 
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Each  TP  has  good  statistics  for  nodes  “close”  to  it  and  therefore  ean  more  aeeurately 
diagnose  problems.  For  instanee,  the  TP  at  B  performed  mueh  better  than  b3  or  A. 


We  ran  several  simulations  varying  the  size  of  the  network,  as  well  as  the  errors  that  eould  oeeur. 
In  all  oases  the  extra  overhead  for  sending  error  measurements  was  small  and  typioally  less  than 
10%  of  the  total  traffio.  The  overhead  for  storing  the  memory  needed  for  the  traffio  and  error 
models  was  large  but  fixed  and  did  not  grow  as  time  passes.  However,  the  oost  is  mitigated  by 
the  faot  that  the  oost  is  distributed  aoross  many  nodes  in  the  network. 


5.  Limitations  and  Future  Research 

Our  experiments  demonstrated  that  distributed  learning  of  network  failures  was  feasible.  In  this 
seotion,  we  disouss  some  of  the  approaoh’s  limitations  and  direotions  for  future  researoh. 

•  Incorporating  state  information.  Our  representation  in  seotion  3.1  was  stateless  and  did 
not  oonsider  the  time  ordering  of  examples.  The  stateless  representation  means  that  it 
eould  be  represented  effioiently,  but  the  drawbaok  was  that  there  was  some  lag  in 
identifying  failures.  Storing  the  oomplete  time  history  of  traffio  and  errors  eould  alleviate 
this  problem  but  obviously  requires  greater  memory.  An  important  task  then  would  be  to 
seleotively  remember  oertain  faots. 

•  Developing  more  sophisticated  models  of  internet  performance  and  classes  of failures. 
Our  models  only  traoked  suooessful  and  failed  oonneotions  between  nodes  in  the  network. 
There  would  be  mueh  benefit  to  traoking  other  oharaoteristios  suoh  as  latenoy  or 
bandwidth.  This  would  require  more  sophistioated  models,  but  would  also  allow 
inferenoe  on  more  eomplex  queries. 

•  Extending  the  available  actions  to  the  TP.  Currently,  the  TPs  ean  only  propagate  error 
messages.  Other  aetions  might  request  another  node’s  diagnosis  of  the  problem. 

•  Incorporating  knowledge  of  topology  into  reasoning  processes.  We  assumed  nodes  eould 
only  tell  whieh  AS  and  end-nodes  were  assoeiated,  but  not  how  the  AS’s  are  eonneeted. 
Knowledge  of  the  topology  would  let  more  eomplex  hypothesis  about  network  failures  be 
eonsidered. 


6.  Conclusions 

We  eonstrueted  a  demonstration  system  and  showed  that  distributed  diagnosis  of  network  faults 
was  feasible.  In  the  distributed  system,  eaeh  end  node  beeomes  a  deteetor  of  network  faults  and 
ean  signal  when  an  error  has  oeeurred.  Experiments  with  a  simulator  showed  that  faults  eould  be 
deteeted  with  a  limited  overhead  for  error  messages  and  memory.  Future  work  should  foeus  on 
the  models  used  by  eaeh  node  to  model  the  traffie  and  errors  that  oeeur. 
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